I. Introduction


While the world population grows, it is important to insure there is enough agricultural land to provide food for the growing population. The goal of this study is to predict the percent of land in a country designated for agricultural purposes in 2016, using data from 2015.

The data used in this study is obtained from the World Bank Data sets1. The ‘Agricultural land (% of land area)’ indicator is the continuous response variable in the study, which accounts for arable land, permanent crops land, and land under permanent pastures2. The 2016 country data is used for the response variable, since this is the most recent data available for this indicator.

The ‘Permanent cropland (% of land area)’ indicator is used as one of the predictor variables, which is land used for crops that last for many seasons, making replanting not necessary. This variable may seem to be unsuitable to predict the % of agricultural land in a country, since the former is used in calculating the latter. However, this issue is not present in this study, since data from 2015 is used to predict data in 2016: the % of permanent cropland in 2015 cannot be highly correlated with the % of agricultural land in 2016, since 2015 permanent cropland percentage is not used to calculate the 2016 agricultural land percentage. This report does not study a casual relationship with the stated response variable in 2016 and the % permanent cropland in 2015, because there is no such relationship: a change in the amount of agricultural land used for permanent crops does not necessarily cause a change in the amount of land used for agricultural purposes. The study considers a linear relationship between the response variable and this predictor, since the % of agricultural land is assumed to increase in an almost linear fashion, as the % of permanent cropland of a country’s land area increases.

The ‘Population growth (annual %)’ indicator for 2015 is used as another predictor variable, where the population refers to the population of people residing in the country, regardless of legal status or citizenship3. This variable is used since there is a relationship between the response variable and this predictor: large population growth means more people need food. However, there is no causal relationship between the response variable and this predictor variable, since population growth can lead to the importation of more food, rather than the expansion of agricultural land. A non-linear relationship will be explored between this predictor and the response variable, since no evidence was found to support a linear relationship.

The ‘Forest area (% of land area)’ indicator for 2015 is another predictor variable, which includes land that has trees at least 5 meters tall, and excludes land with trees used in agricultural production systems, in parks, and in gardens4. This variable was chosen because the expansion of agricultural land tends to happen at the expense of forest land. However, there is no causal relationship, since a decrease in forest land does not necessarily mean the lost land is used for agricultural purposes; the land may have been used for new buildings. A non-linear relationship will be explored between this predictor and the response variable, since no evidence was found to support a linear relationship.

The ‘Agriculture, forestry, and fishing, value added (% of GDP)’ in 2015 indicator is the final predictor variable, where the value added is the net output from the agriculture, forestry, and fishing sectors5. This variable is included in the study, since the decision of land designated for agricultural purposes can be influenced by the added value to the country’s GDP. Hence, a relationship between this predictor and the response variable exists, but the relationship is not causal, since no evidence was found to support this assumption. However, the nature of this relationship is unknown, so a nonlinear relationship will be considered.

Only the four stated predictor variables will be used in predicting the % of agricultural land of a country in 2016. The other present indicators that are related to agricultural land are after-effects of expanded agricultural land, rather than factors related to the increase in agricultural land. For example, the ‘Fertilizer consumption’ indicator. Additionally, other present variables that may seem adequate, such as the ‘Arable land ( % of land area)’ indicator, have been excluded due to their similarity to the ‘Permanent cropland (% of land area)’ indicator, in order to avoid potential correlation issues. These four predictor variables are assumed to be enough to make an adequate prediction model with nonlinear terms, without overfitting the data, which is later checked in the study.

After organizing the data and removing countries with missing values, 225 observations remain out of the original 269 observations (about 83.6%). The sample does not have a weighted design, so no survey design will be applied.

Table 1.Sample for 5 randomly chosen countries of the data set used in this study
Country agricultural_land_p_2016 forest_area_p_2015 population_growth_p_2015
Israel 24.58410 7.624769 1.9812890
Upper middle income 35.89447 36.494328 0.8061052
Japan 12.26410 68.460610 -0.1061250
Costa Rica 34.45946 53.975715 1.0869528
Australia 48.24194 16.238757 1.4392167
Country aded_val_GDP_2015 perm_cropLand_p_2015
Israel 1.195370 4.4685766
Upper middle income 6.814909 1.3624616
Japan 1.113905 0.7982225
Costa Rica 4.956675 6.0712887
Australia 2.372644 0.0429559

II. Exploratory data analysis


With a total sample size of 225 observations, we see the mean and median for the % of agricultural land in a country to be very close, as seen in table 2. This indicates the presence of an approximately normal distribution for the % of agricultural land in a country in 2016, as reaffirmed in figure 1.

Table 2: Summary for the percent of agricultural land in different countries, in 2016
n min median mean max sd
225 0.5576923 39.275 38.90696 82.55971 19.67133
Figure 1. Distribution for the percent of agricultural land in different countries, in 2016

Figure 1. Distribution for the percent of agricultural land in different countries, in 2016

In figures 2 and 3, we see that the distributions for the % of permanent cropland in a country in 2015 and for the % of forest area in a country in 2015 are right skewed, where the former is heavily right skewed. Figure 2 shows that most of the countries have 0% to 10% of their land designated for permanent crops. Figure 4 reaffirms this observation, and shows that the % of permanent cropland never exceeds the % of agricultural land. This shows that the 4 to 5 outliers with more than 25% of land for permanent crops are most likely not measurement errors. We see some curvature using the blue loess curve between 10 to 40% permanent cropland. However, there are very few points in that area, so fitting a flexible model in that area may result in overfitting. Additionally, there seems to be a positive correlation between the % of permanent cropland in 2015 and the % of agricultural land in 2016, since as the former increases, the latter tends to increase. The correlation coefficient was found to be 0.18, which is small, and reaffirms the absence of strong correlation between the % of permanent cropland in 2015 and the % of agricultural land in 2016, mentioned in the introduction.

Figure 4. Interactive Scatterplot for the percent of agricultural land in different countries, in 2016 against their 2015 Permanent Crop Land (% of land area). The red line is the best fit line. The blue curve is the Loess curve.

In figure 5, we see a negative moderate correlation between the % of forest area in 2015 and % of agricultural land in 2016. The correlation coefficient is about -0.44, which backs up the previous observation. The negative relationship shows that a large % of forest area tends to indicate that there is not a lot of agricultural land, and vice versa. There is an indication of nonlinearity between 0 to 25% of forest area.

Figure 5. Interactive Scatterplot for the percent of agricultural land in different countries, in 2016 against their percent of forest area, in 2015. The red line is the best fit line. The blue curve is the Loess curve.

Figure 6 shows the % of the 2015 population growth to be approximately normally distributed, with some slight right skewness. The scatter plot in figure 8 shows little to no correlation with the % of agricultural land in 2016, with some potential outliers on either side of the plot. However, the loess curve shows some correlation at different sections in the 2015 % population growth range. By exploring the nonlinear relationship between the % of the 2015 population growth and the % of agricultural land in 2016, the influence of the outliers should be reduced.

Figure 7 shows a heavily right skewed distribution for the % added value of agriculture, forestry, and fishing to a country’s GDP in 2015. In figure 9, we see that this predictor has a noticeable positive correlation with the % of agricultural land in 2016. However, this relationship may be largely influenced by the three possible outliers with an added value greater than 40% . The loess smooth curve shows that this positive relationship is more apparent before 20% added value, after which there is not much correlation. This shows that when the % added value is low, a country tends to increase the amount of land designated for agriculture, but once a certain amount of added value is attained, the agricultural land is not significantly expanded.

Figure 8. Interactive Scatterplot for the percent of agricultural land in different countries, in 2016 against their percent annual population growth in 2015. The red line is the best fit line. The blue curve is the Loess curve.

Figure 9. Interactive Scatterplot for the percent of agricultural land in different countries, in 2016 against the % added value of Agriculture, forestry, and fishing to their GDP in 2015. The red line is the best fit line. The blue curve is the Loess curve.


III. Multiple linear regression

i. Methods


It was found in the exploratory analysis that the response variable, % of agricultural land in 2016, is approximately normally distributed. So, no transformation will be applied to the variable, since there are no extreme outliers whose affects need to be reduced.

As decided at the beginning of the study, a linear relationship between the % of agricultural land in 2016 and the % of permanent crop land in 2015 will be considered, and a nonlinear one with the other predictors. To capture the nonlinear relationships, natural splines will be applied to each of the predictor variables: 2015 % population growth, 2015 % forest area, and 2015 % added value. Each spline will have 4 degrees of freedom (which corresponds to 5 knots), following the rule of thumb for a sample size greater than 100 observations.

The model is:

## lm(formula = agricultural_land_p_2016 ~ perm_cropLand_p_2015 + 
##     ns(population_growth_p_2015, df = 4) + ns(forest_area_p_2015, 
##     df = 4) + ns(aded_val_GDP_2015, df = 4), data = tidy_joined_dataset)

Using the model above, we get a normal QQ plot for the response variable in figure 10, which appears approximately straight, except for some slight trailing-off at the tails. Figure 11 shows the residuals of the model to have an approximately normal distribution. Figure 12 shows a cloud-shaped scatterplot for the residuals. In the latter graph, there appears to be possible outliers on the extreme right and left sides of the fitted values range. However, their influence does not seem to cause heteroscedasticity.

Figure 10. Normal Q-Qplot for the percent of agricultural land in different countries, in 2016

Figure 10. Normal Q-Qplot for the percent of agricultural land in different countries, in 2016

In table 3, we see that the GVIF value for the variables with 1 degree of freedom each, and the GVIF^(1/(2*Df)) value for the variables with more than 1 degree of freedom each are all between 1 and 5. This indicates that there is a moderate correlation between the predictor variables. Since there is not a lot of multicollinearity between the predictor variables, the statistical power of the model is not greatly reduced.

Additionally, table 3 shows that the model has a total of 13 degrees of freedom. For a study with a continuous response variable and a sample size of 225 observations, we have \(\frac{225}{13}\) = 17.3 which is greater than 15 (as a rule of thumb). Thus, the model has a reasonable number of predictors for the given sample, which may reduce the risk of overfitting.

Since the model met the assumptions of linear regression, we can proceed with the analysis.

Table 3: VIF table
GVIF Df GVIF^(1/(2*Df))
perm_cropLand_p_2015 1.124085 1 1.060229
ns(population_growth_p_2015, df = 4) 2.421891 4 1.116913
ns(forest_area_p_2015, df = 4) 1.490688 4 1.051171
ns(aded_val_GDP_2015, df = 4) 2.007460 4 1.091015

ii. Model Results and Interpretation


The model is:

## lm(formula = agricultural_land_p_2016 ~ perm_cropLand_p_2015 + 
##     ns(population_growth_p_2015, df = 4) + ns(forest_area_p_2015, 
##     df = 4) + ns(aded_val_GDP_2015, df = 4), data = tidy_joined_dataset)

Due to using splines on the 2015 % population growth, 2015 % forest area, and 2015 % added value of agriculture, foresting, and fishing to a country’s GDP, their coefficients in table 4 cannot be interpreted. However, the p-values for the 4th level of the 2015 % population growth variable coefficient, the 2nd and 4th levels of the 2015 % forest area variable coefficients, and the 1st and 3rd levels of the 2015 % added value variable coefficients are less than 0.05. This means that the splines for the mentioned levels are helpful in our model for predicting the 2016 % of agricultural land in a country, while the other levels are not.

As for the 2015 % of permanent crop land coefficient, for every additional percent of permanent crop land in 2015, the 2016 % of agricultural land in the country tends to increase by 0.4914, when all the other variables in the model (2015 % population growth, 2015 % forest area, and 2015 % added value) are held constant. The p-value for this coefficient is less than 0.05, making it significant in the model for predicting the 2016 % of agricultural land in a country.

When a country has no population growth in 2015, no permanent cropland in 2015, no forest area in 2015, and no added value from agriculture, foresting, and fishing to the 2015 GDP of the country, then the country would have a mean of 24.1215% of their land designated to agriculture in 2016. The p-value for the intercept is less than 0.05, making it significant.

Table 4. Model Summary Table
Estimate Std. Error t value Pr(>|t|)
(Intercept) 24.1215 10.3841 2.3229 0.0211
perm_cropLand_p_2015 0.4914 0.1434 3.4269 0.0007
ns(population_growth_p_2015, df = 4)1 -9.6982 8.5759 -1.1309 0.2594
ns(population_growth_p_2015, df = 4)2 10.3611 7.9389 1.3051 0.1933
ns(population_growth_p_2015, df = 4)3 -20.9933 19.8490 -1.0576 0.2914
ns(population_growth_p_2015, df = 4)4 -42.8759 11.5787 -3.7030 0.0003
ns(forest_area_p_2015, df = 4)1 -0.5732 4.3938 -0.1305 0.8963
ns(forest_area_p_2015, df = 4)2 -19.1007 5.0858 -3.7557 0.0002
ns(forest_area_p_2015, df = 4)3 2.6878 9.8495 0.2729 0.7852
ns(forest_area_p_2015, df = 4)4 -49.7971 7.6792 -6.4847 0.0000
ns(aded_val_GDP_2015, df = 4)1 15.5354 4.4557 3.4866 0.0006
ns(aded_val_GDP_2015, df = 4)2 7.3391 6.7614 1.0854 0.2790
ns(aded_val_GDP_2015, df = 4)3 42.0494 10.6767 3.9384 0.0001
ns(aded_val_GDP_2015, df = 4)4 13.8894 11.0561 1.2563 0.2104

The model explains a good amount of variability (about 42.4%) of the 2016 % of agricultural land in a country, given by the R-squared below. With a very small p-value (<< 0.05) for the F-statistic of the model, we conclude that the previously stated model is an adequate model for predicting the 2016 % of agricultural land in a country.

Value df
Residual Standard Error 14.926 211
Multiple R-squared 0.458
Adjusted R-squared 0.424
Value Numerator df Denominator df
Model F-statistic 13.7 13 211
P-value 7.366e-22

iii. Inference for multiple regression

Looking at the type II sums of squares in table 5, we see that the 2015 % permanent cropland explains a significant amount of variation in the 2016 % of agricultural land in a country, when the splines of the other 3 predictors are already in the model, since it has a p-value less than 0.05.

The spline for the 2015 % population growth variable explains a significant amount of variation in the 2016 % of agricultural land in a country, when the 2015 % permanent cropland variable, the spline for the 2015 % forest area variable, and the spline for the 2015 % added value of agriculture, foresting, and fishing to a country’s GDP variable are already in the model, since it has a p-value less than 0.05.

The spline for the 2015 % forest area variable explains a significant amount of variation in the 2016 % of agricultural land in a country, when the 2015 % permanent cropland variable, the spline for the 2015 % population growth variable, and the spline for the 2015 % added value of agriculture, foresting, and fishing to a country’s GDP variable are already in the model, since it has a p-value less than 0.05.

The spline for the 2015 % added value of agriculture, foresting, and fishing to a country’s GDP variable explains a significant amount of variation in the 2016 % of agricultural land in a country, when the 2015 % permanent cropland variable, the spline for the 2015 % population growth variable, and the spline for the 2015 % forest area variable are already in the model, since it has a p-value less than 0.05.

Table 5. ANOVA (Type II tests) Table
Sum Sq Df F value Pr(>F)
perm_cropLand_p_2015 2616.434 1 11.7435 0.0007
ns(population_growth_p_2015, df = 4) 3295.916 4 3.6983 0.0062
ns(forest_area_p_2015, df = 4) 27741.579 4 31.1287 0.0000
ns(aded_val_GDP_2015, df = 4) 4420.418 4 4.9601 0.0008
Residuals 47010.294 211 NA NA

When the 2015 % population growth, 2015 % forest area, and 2015 % added value of agriculture, foresting, and fishing to a country’s GDP variables are constant (equal to the values in the figure 13 caption), we see that the blue 95% confidence intervals for the 2016 % of agricultural land in a country are narrow at 0% permanent cropland in 2015 and get wider after 20%. This shows that there is more accuracy in the area with more points, and less accuracy in the area with less points that may be influential outliers. The pink 95% prediction intervals are all greater than zero, but the ones before the 20% permanent cropland in 2015 are slightly narrower than those after the 20% mark.

Figure 13. Interactive Scatterplot for the percent of agricultural land in different countries, in 2016 against their 2015 Permanent Crop Land (% of land area), where the forest area percentage of a country’s land in 2015 euqals its median = 31.88387, for a median percent population growth in a country in 2015 = 1.256186, and for a median % added value of agriculture, forestry, and fishing to the GDP of a country = 7.408892. The blue line is the linear curve, with its associated 95% CI and wider pink 95% PI.

When the 2015 % permanent cropland, 2015 % forest area, and 2015 % added value of agriculture, foresting, and fishing to a country’s GDP variables are constant (equal to the values in the figure 14 caption), we see that the blue 95% confidence intervals for the 2016 % of agricultural land in a country are wide at the edges of the 2015 % population growth range and show inaccuracy. Whereas the intervals are narrower in the center (between 0% and 3% 2015 population growth) where the majority of the observations are located. The pink 95% prediction intervals seem to be of almost equal width across the range of the 2015 % population growth spline. The 2015 % population growth natural spline appears to have angular areas in the curve, rather than a smooth flowing curve. Maybe a different spline, or an adjustment to the natural spline can smoothen out the curve.

Figure 14. Interactive Scatterplot for the percent of agricultural land in different countries, in 2016 against their percent annual population growth in 2015, where the forest area percentage of a country’s land in 2015 euqals its median = 31.88387, for a median % added value of agriculture, forestry, and fishing to the GDP of a country = 7.408892, and for a median percent of permanent crop land of a country’s land area = 1.311853. The blue line is the natural spline, with its associated 95% CI and wider pink 95% PI.

When the 2015 % permanent cropland, 2015 % population growth, and 2015 % added value of agriculture, foresting, and fishing to a country’s GDP variables are constant (equal to the values in the figure 15 caption), we see that the blue 95% confidence intervals for the 2016 % of agricultural land in a country are narrow at 0% forest area in 2015 and get wider after 75%. This shows that there is more accuracy in the area with more points, and less accuracy in the area with less points. The pink 95% prediction intervals seem to be of almost equal width across the range of the % of 2015 forest area.

Figure 15. Interactive Scatterplot for the percent of agricultural land in different countries, in 2016 against their percent of forest area, in 2015, for a median percent population growth in a country in 2015 = 1.256186, for a median % added value of agriculture, forestry, and fishing to the GDP of a country = 7.408892, and for a median percent of permanent crop land of a country’s land area = 1.311853. The blue line is the natural spline, with its associated 95% CI and wider pink 95% PI.

When the 2015 % permanent cropland, 2015 % population growth, and 2015 % forest area variables are constant (equal to the values in the figure 16 caption), we see that the blue 95% confidence intervals for the 2016 % of agricultural land in a country are narrow at 0% added value of agriculture, foresting, and fishing to a country’s 2015 GDP, and get wider after 20%. This shows that there is more accuracy before 20% added value, and less accuracy after 20% added value. The pink 95% prediction intervals seem to be of almost equal width across the range of the % added value of agriculture, foresting, and fishing to a country’s 2015 GDP.

Figure 16. Interactive Scatterplot for the percent of agricultural land in different countries, in 2016 against the % added value of Agriculture, forestry, and fishing to their GDP in 2015, where the forest area percentage of a country’s land in 2015 euqals its median = 31.88387, for a median percent population growth in a country in 2015 = 1.256186, and for a median percent of permanent crop land of a country’s land area = 1.311853. The blue line is the natural spline, with its associated 95% CI and wider pink 95% PI.

Figures 14, 15, and 16 show some 95% prediction intervals at the edges to contain negative numbers. This is problematic, since a country cannot have a negative percentage of agricultural land. To solve this issue, we could take a log() transformation on the response variable. However, a log transformation on the response variable in the model skewes the residuals of the model and makes the interpretation more complicated than it has to be. Earlier in this study, it was established that the linear regression assumptions were met and that the residual distribution and scatter plot did not show any extreme outliers, so the simple approach was taken to not transform the response variable, and that decision was carried throughout the study.

IV. Discussion

i. Conclusions

Overall, the model is adequate for predicting the % of agricultural land in 2016 for a country, and it explains a good 42.4% of the variability in the % of agricultural land in 2016 for a country. The addition of the 2015 % permanent cropland predictor, when the other splines are in the model, explains a significant amount of variation in the % of agricultural land in 2016 for a country. The 2015 % permanent cropland appears to have a positive correlation with the response variable. Similarly, adding each of the spline for the 2015 % population growth, for the 2015 % forest area, and for the 2015 % added value of agriculture, foresting, and fishing to a country’s GDP explains a significant amount of variation in the % of agricultural land in 2016 for a country, when the other variables (other than the one being added) are already in the model.

The significance of these predictors is not unexpected, since they were chosen for their relationship with the 2016 % of agricultural land, as stated in the introduction.

ii. Limitations

There were not many options for desirable indicators to choose from. That resulted in a small number of predictors, making dimension reduction techniques such as PCA and clustering unnecessary. Most of the available variables were not specific enough to be related to the goal of the study: to predict the % of agricultural land in a country in 2016.

While working with countries as observations, the number of observations would not exceed the 300s or 200s. This greatly limits the flexibility of our model, if we want to avoid overfitting by keeping the model’s degrees of freedom less than \(\frac{\text{sample size}}{15}\). Using splines was limiting, since it made the model harder to interpret. However, the interpretability of the model was a good tradeoff for a better predictive model.

iii. Further questions

There were some potential extreme outliers in the study, but not enough information is available to validate the accuracy of these outliers (i.e. to check for measurement errors). A sensitivity analysis can be conducted to check for the effect of these outliers on the model.

The model obtained from this study was not validated nor tested, so this is something that can be done in a future study.


V. Citations and References


  1. “Indicators”. The World Bank - Data. Accessed December 2020. https://data.worldbank.org/indicator↩︎

  2. “AG.LND.AGRI.K2”. Metadata Glossary, Data Bank. Accessed December 2020. https://databank.worldbank.org/metadataglossary/world-development-indicators/series/AG.LND.AGRI.K2#:~:text=Agricultural%20land%20refers%20to%20the,crops%2C%20and%20under%20permanent%20pastures.↩︎

  3. “Population Growth (Annual %)”. Data Catalog, The World Bank. Accessed December 2020. https://datacatalog.worldbank.org/population-growth-annual↩︎

  4. “Forest Area (% of Land Area)”. Data Catalog, The World Bank. Accessed December 2020. https://datacatalog.worldbank.org/forest-area-land-area-3↩︎

  5. “Agriculture, forestry, and fishing, value added (% of GDP)”. Data Catalog, The World Bank. Accessed December 2020. https://datacatalog.worldbank.org/agriculture-forestry-and-fishing-value-added-gdp-0↩︎